An analysis of the Covid-19 Pandemic

Introduction

COVID-19 is the disease caused by SARS-CoV-2, the coronavirus that emerged in December 2019. COVID-19 can be severe, and has caused millions of deaths around the world as well as lasting health problems in some who have survived the illness. The coronavirus can be spread from person to person. It is diagnosed with a test.

Two years after the break out of the corona virus, the growth of COVID comfirmation rate seems to slow down, providing us the best timing to examine the pendemic as a whole. Here We would like to look at COVID in United states, and Maryland in specific, and discuss what the data illustrates us.

1. Data Collection

The data collection stage is very important. Without proper data to work with, no analysis can be done. Make sure to find credible and recent data to create accurate models and analysis.

In this project the Covid-19 data we used comes from Johns Hopkins University and is available at this link: https://github.com/CSSEGISandData/COVID-19

1.1 Tool used

We used the following tools to collect this data: pandas, numpy, matplotlib, scikit-learn, seaborn, and more.

In [ ]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn
import warnings
import os
import folium
warnings.filterwarnings('ignore')

1.2 Data processing

In [ ]:
world = pd.read_csv("https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_confirmed_global.csv", sep=',')
world.drop(columns = 'Province/State', inplace = True)
world
Out[ ]:
Country/Region Lat Long 1/22/20 1/23/20 1/24/20 1/25/20 1/26/20 1/27/20 1/28/20 ... 12/2/22 12/3/22 12/4/22 12/5/22 12/6/22 12/7/22 12/8/22 12/9/22 12/10/22 12/11/22
0 Afghanistan 33.939110 67.709953 0 0 0 0 0 0 0 ... 206133 206145 206206 206273 206331 206414 206465 206504 206543 206603
1 Albania 41.153300 20.168300 0 0 0 0 0 0 0 ... 333381 333391 333408 333413 333455 333472 333490 333491 333521 333533
2 Algeria 28.033900 1.659600 0 0 0 0 0 0 0 ... 271100 271102 271107 271113 271122 271128 271135 271140 271146 271146
3 Andorra 42.506300 1.521800 0 0 0 0 0 0 0 ... 47219 47219 47219 47219 47219 47446 47446 47446 47446 47446
4 Angola -11.202700 17.873900 0 0 0 0 0 0 0 ... 104676 104676 104676 104750 104750 104808 104808 104808 104808 104808
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
284 West Bank and Gaza 31.952200 35.233200 0 0 0 0 0 0 0 ... 703036 703036 703036 703036 703036 703036 703036 703036 703036 703036
285 Winter Olympics 2022 39.904200 116.407400 0 0 0 0 0 0 0 ... 535 535 535 535 535 535 535 535 535 535
286 Yemen 15.552727 48.516388 0 0 0 0 0 0 0 ... 11945 11945 11945 11945 11945 11945 11945 11945 11945 11945
287 Zambia -13.133897 27.849332 0 0 0 0 0 0 0 ... 333746 333746 333746 333746 333746 333746 333746 333746 333746 333746
288 Zimbabwe -19.015438 29.154857 0 0 0 0 0 0 0 ... 259164 259164 259164 259164 259356 259356 259356 259356 259356 259356

289 rows × 1058 columns

Confirmed cases in each state of the United States

In [ ]:
us = pd.read_csv("https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_confirmed_US.csv", sep=',')
MD = us[us["Province_State"] == "Maryland"]
frames = [MD]
data = MD.drop(us.columns[0:11], axis=1)
data = data.append(data.sum(numeric_only=True), ignore_index=True)
data.drop(data.index[0:26], inplace=True)
list1 = ["New York", "Florida", "Nebraska", "Kansas", "Washington", "California"]
for x in list1:
  state = us[us["Province_State"] == x]
  frames.append(state)
  time = state.drop(state.columns[0:11], axis=1)
  sum = time.append(time.sum(numeric_only=True), ignore_index=True)
  data = data.append(sum.sum(numeric_only=True), ignore_index=True)
data = data.rename(index={0: 'Maryland', 1: 'New York', 2: 'Florida', 3: 'Nebraska', 4: 'Kansas', 5: 'Washington', 6: 'California'})
result = pd.concat(frames)
data
Out[ ]:
1/22/20 1/23/20 1/24/20 1/25/20 1/26/20 1/27/20 1/28/20 1/29/20 1/30/20 1/31/20 ... 12/2/22 12/3/22 12/4/22 12/5/22 12/6/22 12/7/22 12/8/22 12/9/22 12/10/22 12/11/22
Maryland 0 0 0 0 0 0 0 0 0 0 ... 1290858 1290858 1290858 1291854 1294038 1294916 1295950 1297092 1297092 1297092
New York 0 0 0 0 0 0 0 0 0 0 ... 12795650 12802248 12806764 12827458 12838722 12850488 12867048 12881668 12887456 12893878
Florida 0 0 0 0 0 0 0 0 0 0 ... 14540820 14540820 14540820 14540820 14540820 14540820 14540820 14540820 14540820 14540820
Nebraska 0 0 0 0 0 0 0 0 0 0 ... 1086270 1086270 1086270 1086270 1086270 1086270 1093506 1093506 1093506 1093506
Kansas 0 0 0 0 0 0 0 0 0 0 ... 1804072 1804072 1804072 1804236 1804236 1812612 1812612 1812612 1812612 1812612
Washington 2 2 2 2 2 2 2 2 2 2 ... 3719716 3719716 3719716 3719716 3719716 3732686 3732686 3732686 3732686 3732686
California 0 0 0 0 4 4 4 4 4 6 ... 23086862 23086862 23086862 23106942 23189956 23200170 23233140 23246084 23246084 23246084

7 rows × 1055 columns

Number of deaths by state in the US

In [ ]:
us_dead = pd.read_csv("https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_deaths_US.csv", sep=',')
MD2 = us_dead[us_dead["Province_State"] == "Maryland"]
data2 = MD2.drop(MD2.columns[0:12], axis=1)
data2 = data2.append(data2.sum(numeric_only=True), ignore_index=True)
data2.drop(data2.index[0:26], inplace=True)
for x in list1:
  state2 = us_dead[us_dead["Province_State"] == x]
  time2 = state2.drop(state2.columns[0:12], axis=1)
  sum2 = time2.append(time2.sum(numeric_only=True), ignore_index=True)
  data2 = data2.append(sum2.sum(numeric_only=True), ignore_index=True)
data2 = data2.rename(index={0: 'Maryland', 1: 'New York', 2: 'Florida', 3: 'Nebraska', 4: 'Kansas', 5: 'Washington', 6: 'California'})
data2
Out[ ]:
1/22/20 1/23/20 1/24/20 1/25/20 1/26/20 1/27/20 1/28/20 1/29/20 1/30/20 1/31/20 ... 12/2/22 12/3/22 12/4/22 12/5/22 12/6/22 12/7/22 12/8/22 12/9/22 12/10/22 12/11/22
Maryland 0 0 0 0 0 0 0 0 0 0 ... 15748 15748 15748 15756 15775 15785 15792 15800 15800 15800
New York 0 0 0 0 0 0 0 0 0 0 ... 148100 148100 148100 148180 148332 148434 148504 148566 148566 148566
Florida 0 0 0 0 0 0 0 0 0 0 ... 166402 166402 166402 166402 166402 166402 166402 166402 166402 166402
Nebraska 0 0 0 0 0 0 0 0 0 0 ... 9326 9326 9326 9326 9326 9326 9344 9344 9344 9344
Kansas 0 0 0 0 0 0 0 0 0 0 ... 19358 19358 19358 19358 19358 19384 19384 19384 19384 19384
Washington 0 0 0 0 0 0 0 0 0 0 ... 29410 29410 29410 29410 29410 29658 29658 29658 29658 29658
California 0 0 0 0 0 0 0 0 0 0 ... 195234 195234 195234 195290 195310 195364 195634 195668 195668 195668

7 rows × 1055 columns

2. Data management/representation 图表

确诊趋势

In [ ]:
data = data.swapaxes("index", "columns")
data2 = data2.swapaxes("index", "columns")
In [ ]:
data.plot()
Out[ ]:
<matplotlib.axes._subplots.AxesSubplot at 0x7ff49a7c6340>

死亡数趋势

In [ ]:
data2.plot()
Out[ ]:
<matplotlib.axes._subplots.AxesSubplot at 0x7ff499d3afa0>

画个地图 no yes

In [ ]:
result = result.drop(us.columns[[0,1,2,3,4,7,10]], axis=1)
melted = pd.melt(result, ['Admin2','Province_State', 'Lat', 'Long_'], var_name="Date", value_name='Cases')
melted = melted.drop(columns=['Province_State'])
melted = melted.rename(columns={'Admin2': 'Admin', 'Long_': 'Long'})
melted["Date"] = pd.to_datetime(melted['Date'])
melted = melted.groupby(['Admin', 'Date']).sum()
melted["Next_day"] = melted['Cases'].shift(fill_value=0)
melted["Daily_change"]= melted['Cases'] - melted['Next_day']
melted = melted.drop(columns=['Next_day'])
melted = melted.reset_index()
melted = melted[melted["Daily_change"] >= 0]
In [ ]:
import plotly.express as px
df = melted
df["Date"] = df["Date"].astype(str)
fig = px.scatter_geo(df, lat="Lat", lon="Long",
                     hover_name="Admin", size="Cases",size_max=80,
                     animation_frame="Date",
                     scope = "usa",
                     title = "Total Cases")
fig.layout.updatemenus[0].buttons[0].args[1]["frame"]["duration"] = 100
fig.show()

3. Exploratory data analysis

4. Hypothesis testing

5. Communication of insights attained

Motivation: each tutorial should be sufficiently motivated. If there is not motivation for the analysis, why would we ‘do data science’ on this topic?

Understanding: the reader of the tutorial should walk away with some new understanding of the topic at hand. If it’s not possible for a reader to state ‘what they learned’ from reading your tutorial, then why do the analysis?

Resources: tutorials should help the reader learn a skill, but they should also provide a launching pad for the reader to further develop that skill. The tutorial should link to additional resources wherever appropriate, so that a well-motivated reader can read further on techniques that have been used in the tutorial.

Prose: it’s very easy to write the literal English for what the Python code is doing, but that’s not very useful. The prose should enhance, the tutorial, adding additional context and insight.

Code: code should be clear and commented. Function definitions should be described and given context/motivation. If the prose helps the reader understand why you’ve written the code, the comments in the code should be sufficient for the reader to learn how.

Pipeline: all stages of the pipeline should be discussed. We will be looking for ‘good science’, with discussion of each stage and what it’s implications/consequences are.

Communication of Approach: every technical choice has alternatives, why did you choose the approach taken in the tutorial? A reader should walk away with some idea of what the trade-offs may be.

Formatting and Subjective Evaluation: does the tutorial seem polished and ‘publishable’, or haphazard and quickly thrown together? The tutorials should read as well put-together and having undergone a few iterations of editing and refinement. This should be the easiest of the dimensions.